Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add check functions #276

Merged
merged 7 commits into from
Oct 25, 2023
Merged

add check functions #276

merged 7 commits into from
Oct 25, 2023

Conversation

dougbrn
Copy link
Collaborator

@dougbrn dougbrn commented Oct 23, 2023

This PR addresses #275, adding two functions (check_sorted and check_lightcurve_cohesion) that assess whether the ensemble data is sorted and whether the source data is co-located within single partitions (referring to this as "cohesion").

There's a bit of a snowball effect to implementing these, so this ends up addressing #242 by adding a sort flag to loader calls that defaults to False. This allows the Ensemble to actually be unsorted, making the check_sorted function useful. The sort flag was added to Dask in 2023.6.1, so I've also updated the minimum version to this in pyproject.toml, this broke something in ensemble.insert_sources, where a small tweak is needed to the ColumnMapper and sorting needs to be on for it and potentially any upstream data loading. This may be confusing to users, but I also see this function as mainly for devs and is probably unlikely that users will need it much?

@codecov
Copy link

codecov bot commented Oct 23, 2023

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (1e5c20b) 0.00% compared to head (66203f9) 93.70%.

Additional details and impacted files
@@            Coverage Diff            @@
##           main     #276       +/-   ##
=========================================
+ Coverage      0   93.70%   +93.70%     
=========================================
  Files         0       23       +23     
  Lines         0     1192     +1192     
=========================================
+ Hits          0     1117     +1117     
- Misses        0       75       +75     
Files Coverage Δ
src/tape/ensemble.py 91.88% <90.90%> (ø)

... and 22 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dougbrn
Copy link
Collaborator Author

dougbrn commented Oct 23, 2023

This PR adds 2 lines that are uncovered by unit tests, its a conditional to check and raise an error if the user has supplied something other than "object" or "source" to check_sorted. I would argue this is fine, as these conditionals are about to go away across the Ensemble once the refactor goes in.

@dougbrn dougbrn marked this pull request as ready for review October 23, 2023 23:08
@dougbrn dougbrn requested a review from jeremykubica October 24, 2023 16:41
idx = self._source.index
else:
raise ValueError(f"{table} is not one of 'object' or 'source'")
return idx.map_partitions(lambda a: np.all(a[:-1] <= a[1:])).compute().all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not obvious to me how this works. Doesn't map_partitions call the function (in this case the lambda) once per partition? If so, it is only checking that the last element is smaller or equal to the first one. Wouldn't that leave out the elements in the middle?

From the tests this looks like it works, so I must be missing something. Is there a comment you could add to provide more info?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for calling this out, it's a vectorized operation that's actually doing element-wise comparison within the partition. As I was trying to put together an explanation, I realized that this is only doing comparison within the partition, and not across partitions. I've swapped this out to just use the existing is_monotonic_increasing function within the Dask/Pandas Index API, and expanded the test case to test the multi-partition case.

counts = idx.map_partitions(lambda a: Counter(a.unique())).compute()

unq_counter = counts[0]
for i in range(len(counts) - 1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be clearer as range(1, len(counts)) and then unq_counter += counts[i]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@dougbrn dougbrn requested a review from jeremykubica October 24, 2023 21:47
@dougbrn dougbrn merged commit 4243b85 into main Oct 25, 2023
9 checks passed
@dougbrn dougbrn deleted the check_functions branch December 11, 2023 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants